1 Where Do People Drink The Most Beer, Wine And Spirits?

Back in 2014, fivethiryeight.com published an article on alchohol consumption in different countries. The data drinks is available as part of the fivethirtyeight package. Make sure you have installed the fivethirtyeight package before proceeding.

library(fivethirtyeight)
data(drinks)


# or download directly
# alcohol_direct <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv")

What are the variable types? Any missing values we should worry about?

skimr::skim(drinks)
Data summary
Name drinks
Number of rows 193
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 3 28 0 193 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
beer_servings 0 1 106.16 101.14 0 20.0 76.0 188.0 376.0 ▇▃▂▂▁
spirit_servings 0 1 80.99 88.28 0 4.0 56.0 128.0 438.0 ▇▃▂▁▁
wine_servings 0 1 49.45 79.70 0 1.0 8.0 59.0 370.0 ▇▁▁▁▁
total_litres_of_pure_alcohol 0 1 4.72 3.77 0 1.3 4.2 7.2 14.4 ▇▃▅▃▁
alcohol_data <- drinks

TEAM ANSWER:

The variable types are character and numeric. There are no missing values. There are 4 numeric and 1 character type variable.

Make a plot that shows the top 25 beer consuming countries

top25_beer <- drinks %>%
  slice_max(., order_by = beer_servings, n = 25)
  

ggplot(top25_beer,aes(y = reorder(country, beer_servings), x = beer_servings))+
  geom_col()+
  labs(x = "Beer Servings",
       y = "Country",
       title = "Top 25 Beer Consuming Countries")+
  theme_bw()+
  NULL

Make a plot that shows the top 25 wine consuming countries

# YOUR CODE GOES HERE

top25_wine<- drinks %>%
  slice_max(.,order_by = wine_servings,n=25)

ggplot(top25_wine, aes(y = reorder(country, wine_servings), x = wine_servings ))+
  geom_col()+
  labs(x = "Wine Servings",
       y = "Country",
       title = "Top 25 Wine Consuming Countries")+
  theme_bw()+
  NULL

Finally, make a plot that shows the top 25 spirit consuming countries

# YOUR CODE GOES HERE

top25_spirit<- drinks %>%
  slice_max(.,order_by = spirit_servings,n=25)

ggplot(top25_spirit, aes(y = reorder(country, spirit_servings), x = spirit_servings ))+
  geom_col()+
  labs(x = "Spirit Servings",
       y = "Country",
       title = "Top 25 Spirit Consuming Countries")+
  theme_bw()+
  NULL

What can you infer from these plots? Don’t just explain what’s in the graph, but speculate or tell a short story (1-2 paragraphs max).

TEAM ANSWER:

The countries at the top of the list have high production of the given alcoholic beverage. Historic national connection to the given drink and national pride in its production seems to be an influencing factor in the choice of alcohol consumption. For example in Namibia the brewing industry is regarded as a source of national pride.

2 Analysis of movies- IMDB dataset

We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …

Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:

  • gross : The gross earnings in the US box office, not adjusted for inflation
  • budget: The movie’s budget
  • cast_facebook_likes: the number of facebook likes cast memebrs received
  • votes: the number of people who voted for (or rated) the movie in IMDB
  • reviews: the number of reviews for that movie
  • rating: IMDB average rating

2.1 Use your data import, inspection, and cleaning skills to answer the following:

  • Are there any missing values (NAs)? Are all entries distinct or are there duplicate entries?
  • Produce a table with the count of movies by genre, ranked in descending order
  • Produce a table with the average gross earning and budget (gross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending order
  • Produce a table that shows the top 15 directors who have created the highest gross revenue in the box office. Don’t just show the total gross amount, but also the mean, median, and standard deviation per director.
  • Finally, ratings. Produce a table that describes how ratings are distributed by genre. We don’t want just the mean, but also, min, max, median, SD and some kind of a histogram or density graph that visually shows how ratings are distributed.
skimr::skim(movies) # check summary of the dataset
Data summary
Name movies
Number of rows 2961
Number of columns 11
_______________________
Column type frequency:
character 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 1 83 0 2907 0
genre 0 1 5 11 0 17 0
director 0 1 3 32 0 1366 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2.00e+03 9.95e+00 1920.0 2.00e+03 2.00e+03 2.01e+03 2.02e+03 ▁▁▁▂▇
duration 0 1 1.10e+02 2.22e+01 37.0 9.50e+01 1.06e+02 1.19e+02 3.30e+02 ▃▇▁▁▁
gross 0 1 5.81e+07 7.25e+07 703.0 1.23e+07 3.47e+07 7.56e+07 7.61e+08 ▇▁▁▁▁
budget 0 1 4.06e+07 4.37e+07 218.0 1.10e+07 2.60e+07 5.50e+07 3.00e+08 ▇▂▁▁▁
cast_facebook_likes 0 1 1.24e+04 2.05e+04 0.0 2.24e+03 4.60e+03 1.69e+04 6.57e+05 ▇▁▁▁▁
votes 0 1 1.09e+05 1.58e+05 5.0 1.99e+04 5.57e+04 1.33e+05 1.69e+06 ▇▁▁▁▁
reviews 0 1 5.03e+02 4.94e+02 2.0 1.99e+02 3.64e+02 6.31e+02 5.31e+03 ▇▁▁▁▁
rating 0 1 6.39e+00 1.05e+00 1.6 5.80e+00 6.50e+00 7.10e+00 9.30e+00 ▁▁▆▇▁
unique(movies) # check number of uniques entries
## # A tibble: 2,961 x 11
##    title    genre  director  year duration  gross budget cast_facebook_l…  votes
##    <chr>    <chr>  <chr>    <dbl>    <dbl>  <dbl>  <dbl>            <dbl>  <dbl>
##  1 Avatar   Action James C…  2009      178 7.61e8 2.37e8             4834 8.86e5
##  2 Titanic  Drama  James C…  1997      194 6.59e8 2   e8            45223 7.93e5
##  3 Jurassi… Action Colin T…  2015      124 6.52e8 1.5 e8             8458 4.18e5
##  4 The Ave… Action Joss Wh…  2012      173 6.23e8 2.2 e8            87697 9.95e5
##  5 The Dar… Action Christo…  2008      152 5.33e8 1.85e8            57802 1.68e6
##  6 Star Wa… Action George …  1999      136 4.75e8 1.15e8            37723 5.35e5
##  7 Star Wa… Action George …  1977      125 4.61e8 1.1 e7            13485 9.11e5
##  8 Avenger… Action Joss Wh…  2015      141 4.59e8 2.5 e8            92000 4.63e5
##  9 The Dar… Action Christo…  2012      164 4.48e8 2.5 e8           106759 1.14e6
## 10 Shrek 2  Adven… Andrew …  2004       93 4.36e8 1.5 e8             1148 3.15e5
## # … with 2,951 more rows, and 2 more variables: reviews <dbl>, rating <dbl>

TEAM ANSWER TO: Are there any missing values (NAs)? Are all entries distinct or are there duplicate entries?

There are no missing values. There are no duplicate entries however the same titles appear multiple types which is indicative of movie remakes.

# number of movies by genre in descending order
movies_by_genre <- movies %>%
  select(genre)%>%
  count(genre, sort=TRUE, name = "NumOfMovies")
  
# return on budget
average_gross <- movies%>%
  group_by(genre)%>%
  summarise(avg_budget = mean(budget),
            avg_gross = mean(gross))%>%
  mutate(returnBudget = (avg_gross/avg_budget)-1)%>%
  arrange(desc(returnBudget))

# top15 director by gross
top15_gDirectors <- movies%>%
  group_by(director)%>%
  summarise(total_gross = sum(gross),
         avg_gross = mean(gross),
         std_gross = sd(gross),
         median_gross = median(gross))%>%
  slice_max(.,order_by = total_gross, n=15)
  
#rating by Genre
ratingByGenre <- movies%>%
  group_by(genre)%>%
  summarise(mean_rating = mean(rating),
            max_rating = max(rating),
            min_rating = min(rating),
            sd_rating = sd(rating),
            median_rating = median(rating))


ggplot(movies, aes(y = rating, x = reorder(genre,-rating)))+
  geom_boxplot()+
  labs(x = "Genre",
       y = "Rating",
       title = "Ratings by Genre")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust = 0.8))+
  NULL

2.2 Use ggplot to answer the following

  • Examine the relationship between gross and cast_facebook_likes. Produce a scatterplot and write one sentence discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office. What variable are you going to map to the Y- and X- axes?
ggplot(movies, aes(x = cast_facebook_likes, y = gross))+
  geom_point()+
  scale_x_log10()+
  scale_y_log10()+
  geom_smooth(method = "lm")+
  labs(x = "Number of Facebook Likes of the Cast Members",
       y = "Gross Box Office Revenue",
       title = "Relationship between Cast Facebook Likes and Gross Box Office Revenue")+
  theme_bw()+
  NULL

  sprintf("The correlation between cast facebook likes and gross is: %f",cor(movies$cast_facebook_likes, movies$gross))
## [1] "The correlation between cast facebook likes and gross is: 0.213161"

TEAM ANSWER:

The relationship between gross and cast_facebook_likes is of positive has a 0.213 correlation which is not strong, meaning it is not the best predictor.

  • Examine the relationship between gross and budget. Produce a scatterplot and write one sentence discussing whether budget is likely to be a good predictor of how much money a movie will make at the box office.
ggplot(movies, aes(x=budget, y = gross))+
  geom_point()+
  geom_smooth(method = "lm")+
  scale_x_log10()+
  scale_y_log10()+
  labs(x = "Budget of the Movie",
       y = "Box Office Gross Revenue",
       title = "Relationship between Budget and Box Office Revenue")+
  theme_bw()+
  NULL

sprintf("Correlation between budget and gross: %f", s<-cor(movies$budget, movies$gross))
## [1] "Correlation between budget and gross: 0.640834"

TEAM ANSWER:

The correlation between a movies budget and gross revenue at the box office has a stronger correlation than the facebook likes of the cast and the revenue, it is a better predictor of gross, but the correlation is still below 0.7.

  • Examine the relationship between gross and rating. Produce a scatterplot, faceted by genre and discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office. Is there anything strange in this dataset?
ggplot(movies,aes(x = rating, y = gross))+
  geom_point()+
  geom_smooth(method="lm")+
  scale_y_log10()+
  facet_wrap(~genre)+
  labs(x = "Ratings",
       y = "Gross Box Office Revenue",
       title = "Relationship between Ratings and Box Office Revenue by Genre")+
  theme_bw()+
  NULL

>TEAM ANSWER:

IMDB ratings are a weak predictor of a movies success at the box office as it can be seen for most cases the best fit line is almost horizontal. The dataset is unbalanced in the sense that Action movies are over represented while Thriller, Musical, Romance, Western and Family barely have datapoints. Using it to learn about tendencies in genre would result in a incorrect statistical conclusions. The dataset would result in biased machine learning learning models when it comes to predicting movie success.

3 Returns of financial stocks

You may find useful the material on finance data sources.

We will use the tidyquant package to download historical data of stock prices, calculate returns, and examine the distribution of returns.

We must first identify which stocks we want to download data for, and for this we must know their ticker symbol; Apple is known as AAPL, Microsoft as MSFT, McDonald’s as MCD, etc. The file nyse.csv contains 508 stocks listed on the NYSE, their ticker symbol, name, the IPO (Initial Public Offering) year, and the sector and industry the company is in.

nyse <- read_csv(here::here("data","nyse.csv"))
glimpse(nyse)
## Rows: 508
## Columns: 6
## $ symbol        <chr> "MMM", "ABB", "ABT", "ABBV", "ACN", "AAP", "AFL", "A", "…
## $ name          <chr> "3M Company", "ABB Ltd", "Abbott Laboratories", "AbbVie …
## $ ipo_year      <chr> "n/a", "n/a", "n/a", "2012", "2001", "n/a", "n/a", "1999…
## $ sector        <chr> "Health Care", "Consumer Durables", "Health Care", "Heal…
## $ industry      <chr> "Medical/Dental Instruments", "Electrical Products", "Ma…
## $ summary_quote <chr> "https://www.nasdaq.com/symbol/mmm", "https://www.nasdaq…

Based on this dataset, create a table and a bar plot that shows the number of companies per sector, in descending order

# YOUR CODE GOES HERE

companiesPerSector <- nyse%>%
  group_by(sector)%>%
  count(sort = TRUE, name = "numOfComp")
  

ggplot(companiesPerSector, aes(x = numOfComp, y = reorder(sector,numOfComp)))+
  geom_col()+
  labs(x = "Number of Companies",
       y = "Sector",
       title = "Number of Companies in each Sector within the Dataset")+
  theme_bw()+
  NULL

Next, let’s choose some stocks and their ticker symbols and download some data. You MUST choose 6 different stocks from the ones listed below; You should, however, add SPY which is the SP500 ETF (Exchange Traded Fund).

# Notice the cache=TRUE argument inthe chunk options. Because getting data is time consuming, 
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd

myStocks <- c("AAPL","DIS","DPZ","ANF","TSLA","SPY" ) %>%
  tq_get(get  = "stock.prices",
         from = "2011-01-01",
         to   = "2021-08-31") %>%
  group_by(symbol) 

glimpse(myStocks) # examine the structure of the resulting data frame
## Rows: 16,098
## Columns: 8
## Groups: symbol [6]
## $ symbol   <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL…
## $ date     <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, …
## $ open     <dbl> 11.6, 11.9, 11.8, 12.0, 11.9, 12.1, 12.3, 12.3, 12.3, 12.4, 1…
## $ high     <dbl> 11.8, 11.9, 11.9, 12.0, 12.0, 12.3, 12.3, 12.3, 12.4, 12.4, 1…
## $ low      <dbl> 11.6, 11.7, 11.8, 11.9, 11.9, 12.0, 12.1, 12.2, 12.3, 12.3, 1…
## $ close    <dbl> 11.8, 11.8, 11.9, 11.9, 12.0, 12.2, 12.2, 12.3, 12.3, 12.4, 1…
## $ volume   <dbl> 4.45e+08, 3.09e+08, 2.56e+08, 3.00e+08, 3.12e+08, 4.49e+08, 4…
## $ adjusted <dbl> 10.1, 10.2, 10.2, 10.2, 10.3, 10.5, 10.5, 10.6, 10.6, 10.7, 1…

Financial performance analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.

#calculate daily returns
myStocks_returns_daily <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "daily", 
               type       = "log",
               col_rename = "daily_returns",
               cols = c(nested.col))  

#calculate monthly  returns
myStocks_returns_monthly <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "monthly", 
               type       = "arithmetic",
               col_rename = "monthly_returns",
               cols = c(nested.col)) 

#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
  group_by(symbol) %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "yearly", 
               type       = "arithmetic",
               col_rename = "yearly_returns",
               cols = c(nested.col))

Create a table where you summarise monthly returns for each of the stocks and SPY; min, max, median, mean, SD.

# YOUR CODE GOES HERE

monthlyReturnSummary <- myStocks_returns_monthly%>%
  summarise(monthly_max = max(monthly_returns),
            monthly_min = min(monthly_returns),
            monthly_mean = mean(monthly_returns),
            monthly_median = median(monthly_returns),
            monthly_sd = sd(monthly_returns))

Plot a density plot, using geom_density(), for each of the stocks

# YOUR CODE GOES HERE
ggplot(myStocks_returns_monthly, aes(monthly_returns))+
  geom_density()+
  facet_wrap(~symbol)+
  labs(x = "Distribution of Monthly Return",
       y = "Probability % ",
       title  = "Probability Distribution of Monthly Returns")+
  theme_bw()+
  NULL

What can you infer from this plot? Which stock is the riskiest? The least risky?

TEAM ANSWER:

Tesla is the riskiest stock as it has the highest spread, which means its monthly standard deviations are the highest. The least risky is SPY as it has the narrowest spread.

Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. Please use ggrepel::geom_text_repel() to label each stock

# YOUR CODE GOES HERE
library(ggrepel)
ggplot(monthlyReturnSummary, aes(x = monthly_sd, y = monthly_mean))+
  geom_point(size = 4, aes(color = symbol))+
  ggrepel::geom_text_repel(aes(label = symbol), size = 4)+
  labs(x = "Standard Deviation of Monthly Returns",
       y = "Average Monthly Returns",
       title = "Risk vs Potential Return")+
  theme_bw()+
  NULL

What can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?

TEAM ANSWER:

ANF is much riskier than most, while delivering the lowest returns.

4 On your own: IBM HR Analytics

For this task, you will analyse a data set on Human Resoruce Analytics. The IBM HR Analytics Employee Attrition & Performance data set is a fictional data set created by IBM data scientists. Among other things, the data set includes employees’ income, their distance from work, their position in the company, their level of education, etc. A full description can be found on the website.

First let us load the data

hr_dataset <- read_csv(here::here("data", "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)
## Rows: 1,470
## Columns: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department               <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education                <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement           <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole                  <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction          <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating        <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …

I am going to clean the data set, as variable names are in capital letters, some variables are not really necessary, and some variables, e.g., education are given as a number rather than a more useful description

hr_cleaned <- hr_dataset %>% 
  clean_names() %>% 
  mutate(
    education = case_when(
      education == 1 ~ "Below College",
      education == 2 ~ "College",
      education == 3 ~ "Bachelor",
      education == 4 ~ "Master",
      education == 5 ~ "Doctor"
    ),
    environment_satisfaction = case_when(
      environment_satisfaction == 1 ~ "Low",
      environment_satisfaction == 2 ~ "Medium",
      environment_satisfaction == 3 ~ "High",
      environment_satisfaction == 4 ~ "Very High"
    ),
    job_satisfaction = case_when(
      job_satisfaction == 1 ~ "Low",
      job_satisfaction == 2 ~ "Medium",
      job_satisfaction == 3 ~ "High",
      job_satisfaction == 4 ~ "Very High"
    ),
    performance_rating = case_when(
      performance_rating == 1 ~ "Low",
      performance_rating == 2 ~ "Good",
      performance_rating == 3 ~ "Excellent",
      performance_rating == 4 ~ "Outstanding"
    ),
    work_life_balance = case_when(
      work_life_balance == 1 ~ "Bad",
      work_life_balance == 2 ~ "Good",
      work_life_balance == 3 ~ "Better",
      work_life_balance == 4 ~ "Best"
    )
  ) %>% 
  select(age, attrition, daily_rate, department,
         distance_from_home, education,
         gender, job_role,environment_satisfaction,
         job_satisfaction, marital_status,
         monthly_income, num_companies_worked, percent_salary_hike,
         performance_rating, total_working_years,
         work_life_balance, years_at_company,
         years_since_last_promotion)
hr_numeric <- hr_dataset%>%
  clean_names()%>%
  select(age, attrition, daily_rate, department,
         distance_from_home, education,
         gender, job_role,environment_satisfaction,
         job_satisfaction, marital_status,
         monthly_income, num_companies_worked, percent_salary_hike,
         performance_rating, total_working_years,
         work_life_balance, years_at_company,
         years_since_last_promotion)

Produce a one-page summary describing this dataset. Here is a non-exhaustive list of questions:

  1. How often do people leave the company (attrition)
  2. How are age, years_at_company, monthly_income and years_since_last_promotion distributed? can you roughly guess which of these variables is closer to Normal just by looking at summary statistics?
  3. How are job_satisfaction and work_life_balance distributed? Don’t just report counts, but express categories as % of total
  4. Is there any relationship between monthly income and education? Monthly income and gender?
  5. Plot a boxplot of income vs job role. Make sure the highest-paid job roles appear first
  6. Calculate and plot a bar chart of the mean (or median?) income by education level.
  7. Plot the distribution of income by education level. Use a facet_wrap and a theme from ggthemes
  8. Plot income vs age, faceted by job_role
# job satisfaction by department
hr_numeric%>%
  ggplot(aes(x=job_role, y = job_satisfaction))+
  geom_boxplot()+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust = 0.8))+
  labs(title = "Distribution of Job Satisfaction by Job Title",
       x = "Job Title",
       y = "Job Satisfaction")+
  NULL

The job satisfaction of Human Resource employees is significantly lower than every other department where the job satisfaction is consistent.

attrition_number <- count(hr_cleaned["attrition"]=="Yes")

attrition_frequency <- attrition_number/nrow(hr_cleaned)

After looking at the attrition frequency we can see that on average employees leave the company 16% of the time.

skimr::skim(hr_cleaned)
Data summary
Name hr_cleaned
Number of rows 1470
Number of columns 19
_______________________
Column type frequency:
character 10
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
attrition 0 1 2 3 0 2 0
department 0 1 5 22 0 3 0
education 0 1 6 13 0 5 0
gender 0 1 4 6 0 2 0
job_role 0 1 7 25 0 9 0
environment_satisfaction 0 1 3 9 0 4 0
job_satisfaction 0 1 3 9 0 4 0
marital_status 0 1 6 8 0 3 0
performance_rating 0 1 9 11 0 2 0
work_life_balance 0 1 3 6 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 36.92 9.14 18 30 36 43 60 ▂▇▇▃▂
daily_rate 0 1 802.49 403.51 102 465 802 1157 1499 ▇▇▇▇▇
distance_from_home 0 1 9.19 8.11 1 2 7 14 29 ▇▅▂▂▂
monthly_income 0 1 6502.93 4707.96 1009 2911 4919 8379 19999 ▇▅▂▁▂
num_companies_worked 0 1 2.69 2.50 0 1 2 4 9 ▇▃▂▂▁
percent_salary_hike 0 1 15.21 3.66 11 12 14 18 25 ▇▅▃▂▁
total_working_years 0 1 11.28 7.78 0 6 10 15 40 ▇▇▂▁▁
years_at_company 0 1 7.01 6.13 0 3 5 9 40 ▇▂▁▁▁
years_since_last_promotion 0 1 2.19 3.22 0 0 1 3 15 ▇▁▁▁▁

By looking at the summary statistics provided in the data summary, the ‘age’ variable seems to be normally distributed as the interval between the mean, median, max and quartiles are the most even.

‘years_at_company’ variable has a right skewed pattern since the percentiles are not evenly distributed. 25th percentile is 0 meanwhile the median (p50) is 1, the 75th percentile is 3 but then the 100th percentile jumps to 15.

‘monthly_income’ is right skewed, the minimum is 0 and the max is 19999. The median (p50) is 4919. Which means there are either unpaid interns or there are mistakes in the dataset.

The ‘years_since_last_promotion’ variable, the 75th percentile (p75) is 3 meaning that 75 percent of the data is 3 or less but then the 100 percentile is 15 which tells us that the data is very right skewed.

library(ggrepel)
library(forcats)
library(formattable)

hr_cleaned %>% 
  group_by(job_satisfaction) %>% 
  summarise(count=n()) %>% 
  mutate(percent_t = count/sum(count),
         job_satisfaction_n = case_when(
            job_satisfaction == "Low" ~1,
            job_satisfaction == "Medium"~2,
            job_satisfaction == "High"~3,
            job_satisfaction == "Very High"~4
         )) %>% 
  ggplot(aes(x=fct_reorder(job_satisfaction,job_satisfaction_n),y=percent_t))+
  geom_col()+
  geom_text_repel(aes(label= sprintf("%0.2f %%", 100*percent_t)))+
  scale_y_continuous(labels = percent)+
  labs(title="Job Satisfaction",
       x="Job Satisfaction",
       y = "% of employees")+
  theme_bw()+
  NULL

hr_cleaned %>% 
  group_by(work_life_balance) %>% 
  summarise(count= n()) %>% 
  mutate(percent_t = count/sum(count),
         work_life_balance_n = case_when(
          work_life_balance == "Bad" ~1,
          work_life_balance ==  "Good" ~2,
          work_life_balance ==   "Better" ~3,
          work_life_balance ==   "Best"~4
         )) %>% 
  ggplot(aes(x=fct_reorder(work_life_balance,work_life_balance_n),y=percent_t))+
  geom_col()+
  geom_text_repel(aes(label= sprintf("%0.2f %%", 100*percent_t)))+
  labs(title="Work Life Balance",
       x="Work Life Balance",
       y = "% of employees")+
  scale_y_continuous(labels = percent)+
  theme_bw()+
  NULL

“job_satisfaction”: 61% people of the employees are very satisfied with their job, whereas around 19% of people are okay with their jobs. The remaining 20% are have low satisfaction ratings. Job satisfaction seems to be left skewed.

“work_life_balance”: This variable is a bit more normally distributed than job_satisfaction, although it is still left skewed. 10% of the people have a very good work life balance while 5% people are not happy with their work life balance. 60% of employees report a pretty good score and the remaining 23% seems to be doing okay.

hr_cleaned$education <- factor(hr_cleaned$education,levels=c("Below College","College","Bachelor","Master","Doctor"))

ggplot(hr_cleaned,aes(x = education, y = monthly_income))+
  geom_boxplot() +
   labs(x = "Education Level",
       y = "Monthly Income ",
       title  = "Boxplot of Monthly Income by Education Level")+
  theme_bw()+
  NULL

cor(hr_dataset["Education"],hr_dataset["MonthlyIncome"])
##           MonthlyIncome
## Education         0.095

Relationship between monthly income and education:

r-value:0.09496068

After looking at the box plot and our r-value, it can be seen that monthly income and education level has slight positive correlation. The more educated you are, the more income you earn. But because the r-value is still very close to zero, we would avoid saying they are highly correlated. Even though the r value is low, looking at the box plots it is obvious that education level can contribute to monthly income.

gender_income <- hr_dataset %>% 
  mutate(binary_gender = case_when(Gender=="Male" ~0 ,
                                   Gender == "Female" ~ 1))%>%
  select(MonthlyIncome,binary_gender)

cor(gender_income["binary_gender"],hr_dataset["MonthlyIncome"])
##               MonthlyIncome
## binary_gender        0.0319
ggplot(hr_dataset,aes(x = Gender, MonthlyIncome))+
  geom_boxplot() +
  labs(x = "Gender",
       y = "Monthly Income ",
       title  = "Boxplot of Monthly Income by Gender")+
  theme_bw()+
  NULL

Relationship between monthly income and gender:

r-value: 0.03185849. The r-value is positive towards women with a value of 0.03185849. But this is close enough to zero. Looking at the box plot of monthly income by gender, we can see that females make a little bit more but it is almost negligible.

hr_cleaned %>% 
  ggplot(aes(x=fct_reorder(job_role,-monthly_income),y=monthly_income))+
  geom_boxplot()+
  labs(title="Boxplot of Monthly Income Against Job Role", x="Job Role", y="Monthly Income")+
  theme(axis.text.x = element_text(angle = 45, hjust = 0.8))+
  NULL

mean_education <- hr_cleaned %>% 
  group_by(education) %>% 
  summarise(mean=mean(monthly_income))

ggplot(mean_education,aes(x=education,y=mean))+
  geom_col()+
  labs(title = "Education vs Mean Monthly Income",
       x = "Education Level",
       y = "Mean Monthly Income ($)")+
  theme_bw()+
  NULL

median_education <- hr_cleaned %>% 
  group_by(education) %>% 
  summarise(median=median(monthly_income))

ggplot(median_education,aes(x=education,y=median))+
  geom_col()+
  labs(title = "Education vs Median Monthly Income",
       x = "Education Level",
       y = "Median Monthly Income ($)")+
  theme_bw()+
  NULL

ggplot(hr_cleaned,aes(monthly_income))+
  geom_histogram()+
  facet_wrap(~education)+
  labs( x = "Monthly Income", y = "Count",
        title = "Histograms of Monthly Income by Education Level")+
  theme_bw()+
  NULL

hr_cleaned %>% 
  ggplot(aes(y=monthly_income,x=age))+
  geom_point()+
  facet_wrap(~job_role)+
  labs(x="Age", y="Monthly Income",
       title= "Scatter Plots of Age vs Income by Job Role")

5 Challenge 1: Replicating a chart

The purpose of this exercise is to reproduce a plot using your dplyr and ggplot2 skills. Read the article The Racial Factor: There’s 77 Counties Which Are Deep Blue But Also Low-Vaxx. Guess What They Have In Common? and have a look at the attached figure.

You dont have to worry about the blue-red backgouns and don’t worry about replicating it exactly, try and see how far you can get. You’re encouraged to work together if you want to and exchange tips/tricks you figured out– and even though the figure in the original article is from early July 2021, you can use the most recent data.

Some hints to get you started:

  1. To get vaccination by county, we will use data from the CDC
  2. You need to get County Presidential Election Returns 2000-2020
  3. Finally, you also need an estimate of the population of each county

TEAM NOTE:

Every datapoint that was below 90% completness was dropped to be in line with the authors process. The graph would be even closer if the seperate datasets used by the author would be downloaded as well. Authors Description of Data Utilisation

vax_complete_pop_pct <- vaccinations%>%
  filter(completeness_pct>90.0)%>%
  group_by(fips)%>%
  summarise(series_complete_pop_pct = max(series_complete_pop_pct))
  

population <- population%>%
  mutate(pop = pop_estimate_2019)

vax_complete_pop_pct <- left_join(x = vax_complete_pop_pct, 
                                  y = population, by = "fips")%>%
  na.omit()


trump_votes <- election2020_results%>%
  filter(candidate == "DONALD J TRUMP")%>%
  filter(mode == "TOTAL")%>%
  group_by(fips)%>%
  mutate(prcOfVote = (candidatevotes/totalvotes)*100)

head(trump_votes)
## # A tibble: 6 x 13
## # Groups:   fips [6]
##    year state  state_po county_name fips  office  candidate party candidatevotes
##   <dbl> <chr>  <chr>    <chr>       <chr> <chr>   <chr>     <chr>          <dbl>
## 1  2020 ALABA… AL       AUTAUGA     01001 PRESID… DONALD J… REPU…          19838
## 2  2020 ALABA… AL       BALDWIN     01003 PRESID… DONALD J… REPU…          83544
## 3  2020 ALABA… AL       BARBOUR     01005 PRESID… DONALD J… REPU…           5622
## 4  2020 ALABA… AL       BIBB        01007 PRESID… DONALD J… REPU…           7525
## 5  2020 ALABA… AL       BLOUNT      01009 PRESID… DONALD J… REPU…          24711
## 6  2020 ALABA… AL       BULLOCK     01011 PRESID… DONALD J… REPU…           1146
## # … with 4 more variables: totalvotes <dbl>, version <dbl>, mode <chr>,
## #   prcOfVote <dbl>
vax_complete_pop_pct <- left_join(x = vax_complete_pop_pct, y = trump_votes, by="fips")%>%
                        na.omit()
library(Hmisc)

ggplot(vax_complete_pop_pct, aes(x = prcOfVote, y = series_complete_pop_pct))+
  geom_point(alpha = 0.2, color = "snow4", aes(size = pop))+
  annotate("rect", xmin = 45, xmax = Inf, ymin = -Inf, ymax = Inf, fill = "indianred1", alpha = 0.3)+
  annotate("rect", xmin = 0, xmax = 55, ymin = -Inf, ymax = Inf, fill = "royalblue1", alpha = 0.3)+
  annotate("line", x = seq(0,100), y = 85, lty = 2, color = "blue")+
  annotate("text", x = 17, y = 87, label = "Herd Immunity threshold (?)", size = 2, fontface = 4, color = "blue")+
  annotate("text", x = 15, y = 15, label = "y = -0.4956x + 0.73669\nR\u00B2 = 0.501", size = 2, color= "red", hjust = "centre", fontface = "bold")+
  annotate("line", x = seq(0,100), y = 51.65, lty = 2,color = "blue")+
  annotate("text", x = 10, y = 54, label =  sprintf("ACTUAL: %0.2f %%",51.65), size =2, fontface = "bold", color = "blue")+
  annotate("line", x = seq(0,100), y = 70, lty = 2,color = "blue")+
  annotate("text", x = 10, y = 72, label =  sprintf("TARGET: %0.2f %%",70), size =2, fontface = "bold", color = "blue")+
  annotate("text", x = 40, y = 15, label = "5/09/2021", color = "red", fontface = "bold", size = 2)+
  scale_size(range = c(.1, 15), name = "Population (M)")+
  geom_point(size= 0.1)+
  geom_smooth(method = "lm", 
              se = FALSE, 
              lty=5, 
              color = "blue", 
              lwd = 0.5)+
  ylab("% of Total Population Vaccinated")+
  xlab("2020 Trump vote %")+
  labs(title = "COVID-19 VACCINATION LEVELS \nOUT OF TOTAL POPULATION BY COUNTY")+
  theme_light()+
  scale_y_continuous(expand = c(0,0), labels = function(y) paste0(y, "%"), breaks = scales::pretty_breaks(n=20), limits = c(0,100))+
  scale_x_continuous(expand = c(0,0), labels = function(x) paste0(x, "%"), breaks = scales::pretty_breaks(n=20), limits = c(0,100))+
  theme(aspect.ratio = 20/18,
        axis.text.x = element_text(size = 5),
        axis.text.y = element_text(size = 5),
        axis.title = element_text(size = 10),
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5))

print(summary(lm(vax_complete_pop_pct,formula = series_complete_pop_pct ~ prcOfVote)))
## 
## Call:
## lm(formula = series_complete_pop_pct ~ prcOfVote, data = vax_complete_pop_pct)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31.71  -4.52   0.16   4.76  34.49 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  73.7843     0.7827    94.3   <2e-16 ***
## prcOfVote    -0.4952     0.0119   -41.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.76 on 1721 degrees of freedom
## Multiple R-squared:  0.502,  Adjusted R-squared:  0.502 
## F-statistic: 1.74e+03 on 1 and 1721 DF,  p-value: <2e-16
sprintf("Percentage of the total population vaccinated: %0.2f ",sum(vax_complete_pop_pct$series_complete_pop_pct*vax_complete_pop_pct$pop)/sum(vax_complete_pop_pct$pop))
## [1] "Percentage of the total population vaccinated: 51.78 "

6 Challenge 2: Opinion polls for the 2021 German elections

The Guardian newspaper has an election poll tracker for the upcoming German election. The list of the opinion polls since Jan 2021 can be found at Wikipedia and your task is to reproduce the graph similar to the one produced by the Guardian.

The following code will scrape the wikipedia page and import the table in a dataframe.

url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor


# get tables that exist on wikipedia page 
tables <- url %>% 
  read_html() %>% 
  html_nodes(css="table")


# parse HTML tables into a dataframe called polls 
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>% 
             html_table(fill=TRUE)%>% 
             janitor::clean_names())


# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
  slice(2:(n()-1)) %>%  # drop the first row, as it contains again the variable names and last row that contains 2017 results
  mutate(
         # polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
         # and we extract it by picking the last 11 characters from that field
         end_date = str_sub(fieldwork_date, -11),
         
         # end_date is still a string, so we convert it into a date object using lubridate::dmy()
         end_date = dmy(end_date),
         
         # we also get the month and week number from the date, if we want to do analysis by month- week, etc.
         month = month(end_date),
         week = isoweek(end_date)
         )
library(plotly)

german_graph <- ggplot(german_election_polls, aes(x=end_date)) +
      geom_smooth(aes(y=union), color="black", size=0.7, span=0.05, se=FALSE)+
      geom_smooth(aes(y=spd), color="darkred", size=0.7, span=0.05, se=FALSE)+     
      geom_smooth(aes(y=grune), color="darkgreen", size=0.7, span=0.105, se=FALSE)+
      geom_smooth(aes(y=fdp), color="yellow", size=0.7, span=0.05, se=FALSE)+
      geom_smooth(aes(y=af_d), color="darkblue",size=0.7, span=0.05, se=FALSE)+
      geom_smooth(aes(y=linke), color="purple", size=0.7, span=0.05, se=FALSE) +
      geom_point(aes(y = union), color = "black", size=0.5, alpha=0.3, show.legend = FALSE) +
       geom_point(aes(y = spd), color = "darkred", size=0.5, alpha=0.3, show.legend = FALSE) +
       geom_point(aes(y = grune), color="darkgreen", size=0.5, alpha=0.3, show.legend = FALSE) +
       geom_point(aes(y= fdp), color="yellow", size=0.5, alpha=0.3, show.legend = FALSE) +
       geom_point(aes(y= af_d), color="darkblue", size=0.5, alpha=0.3, show.legend = FALSE) +
       geom_point(aes(y= linke), color="purple", size=0.5, alpha=0.3, show.legend = FALSE) +
       labs(
            title = "German Election Poll Tracker",
            subtitle = "who will be the next Chancellor?",
            x = "Date (last updated 3 Sep 2021)",
            y = "Percent of Voter Population")+
    theme_light()

fig <- ggplotly(german_graph + ylim(5, 45))

fig<- fig%>%
  layout(hovermode = "x unified")

fig <- style(fig, hoverinfo = "skip", traces = 7:13)


fig

7 Deliverables

There is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.

8 Details

  • Who did you collaborate with: Members of team 11: Nisa Ozer, Lauren Wade, Nereid Kwok, Thomas Giannetti-Fakhoury, Kazmer Nagy-Betegh
  • Approximately how much time did you spend on this problem set: 16 hours
  • What, if anything, gave you the most trouble: manipulating the vaccine dataset, getting the hang of ggplot

Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?

9 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.